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Abstract 


^Algorithms  for  the  dynamic  programming  and  transitive  closure  problems 
are  presented  for  a  linear  pipeline  of  processors.  These  algorithms  require 


only  a  constant  number  of  ports  and  are  optimal  in  their  area  and 
time  requirements.  '  /  t  t 
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1.  Introduction 


Dynamic  programming  and  transitive  closure  are  two  important  computational 
problems.  Dynamic  programming  is  one  of  several  widely  used  problem-solving  tech¬ 
niques  in  computer  science  and  operations  research  (see  Brown’s  review  in  [4]  ).  The 
transitive  closure  algorithm  also  arises  in  many  contexts.  For  example,  in  the  data-flow 
analysis  of  programs,  we  often  need  the  closure  of  the  “call”  relation. 

Straightforward  dynamic  programming  requires  0(n3)  1  sequential  time  where  n  is 
the  problem  size.  Similarly,  well-known  serial  algorithms  for  transitive  closure  of  an  nXn 
matrix  require  0(n3)  time  [19,20].  As  matrix  multiplication  and  transitive  closure  are 
computationally  equivalent  [l],  the  time  complexity  of  the  transitive  closure  algorithm 
can  be  further  reduced  by  the  methods  of  Pan  [12].  However,  the  best  known  upper- 
bound  on  the  time  complexity  for  matrix  multiplication  is  0(n278)  [12]  which  is  achieved 
at  the  expense  of  complicated  code. 

Parallel  algorithms  for  these  two  problems  have  been  studied  in  the  past  [0,11,17]. 
The  best  known  upper  bounds  on  the  parallel  time  complexity  for  these  two  problems  is 
0(n)  reported  by  Guibas  et  al.  [0].  They  use  a  systolic  array  of  0(n2)  processors. 

Systolic  arrays  (see  [9]  for  a  description  of  systolic  arrays  )  have  been  proposed  as  a 
simple  and  effective  means  of  employing  VLSI  technology  to  handle  compute-bound 
problems.  These  array  processors  are  typically  made  up  of  simple,  identical  processing 
elements  (which  we  will  refer  to  as  cells  from  now  on)  that  operate  in  synchrony.  Several 
array  structures  have  been  proposed  that  include  linear  arrays,  rectangular  arrays  and 
hexagonal  arrays.  High  performance  is  achieved  by  extensive  use  of  pipelining  and  mul¬ 
tiprocessing.  In  a  typical  application,  such  arrays  would  be  attached  as  peripheral 

‘/n)  “  0(g(n))  and  ^n)vD(A(n))  if  tkere  exists  contuti  Cj,  and  Cj  ><ch  that  sad 

Cjh(n)  respectively 
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devices  to  a  host  computer  which  inserts  input  values  into  them  and  extracts  output 
values  from  them. 

In  practice  linear  arrays  are  more  attractive  than  two-dimensional  arrays  (like  a 
mesh  and  a  hexagonal  array).  Among  them  are  the  following:  Linear  arrays  have 
bounded  I/O  requirements  [9].  In  a  wafer  containing  faulty  cells,  a  large  percentage  of 
non-faulty  cells  can  be  efficiently  reconfigured  into  a  linear  array  [10].  Synchronization 
between  cells  in  a  linear  array  can  be  achieved  by  a  simple  global  clock  whose  rate  is 
independent  of  the  size  of  the  array  [5]. 

In  this  paper  we  present  linear  array  algorithms  for  dynamic  programming  and 
transitive  closure  problems.  Our  algorithm  uses  O(n)  cells  and  requires  0(n2)  time  steps 
for  dynamic  programming  problems  of  size  n  and  transitive  closure  of  nXn  matrices. 
0(n2)  time  steps  is  optimal  as  at  least  f»2  time  steps  are  needed  to  insert  the  elements  in 
the  array.  Each  of  the  cell  in  the  array  requires  O(n)  storage  (referred  to  as  area  in  the 
VLSI  context).  We  will  show  that  O (n2)  storage  used  in  the  array  is  optima/. 

Parallel  algorithms  for  these  two  problems  that  have  appeared  in  the  past  are 
vulnerable  to  failures  in  the  cells  and  communication  links  in  the  parallel  architectures 
on  which  they  run.  This  is  very  likely  in  the  systolic  array  solution  proposed  by  Guibas 
et  al.  Systolic  arrays  implemented  in  VLSI  can  have  (with  high  probability)  faulty  cells 
and  links  caused  by  production  faults  in  the  manufacturing  process  that  result  in  defects 
occuring  randomly  in  the  wafer  [2]. 

Varman  and  Fussell  [18]  presented  a  technique  to  transform  “one-way”  pipelined 
linear-array  algorithms  (that  is,  algorithms  wherein  elements  in  the  linear  array  move 
only  from  left  to  right)  into  an  equivalent  algorithm  on  any  connected  component  of 
cells  by  configuring  it  into  a  logical-linear  array.  Neighbouring  processors  need  not  be 
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physically  adjacent  in  the  connected  component.  The  connected  component  of  cells 
could  form  the  non-faulty  cells  in  an  underlying  network  that  has  both  faulty  and  non* 
faulty  cells  and  communication  links.  As  we  will  see  later  on,  our  algorithms  for  dynamic 
programming  and  transitive  closure  are  one-way  pipelined  algorithms  and  hence  can  be 
made  robust  by  straightforward  application  of  the  technique  in  [18]. 

The  remainder  of  this  paper  is  organized  as  follows.  In  Sections  2  and  3  we  describe 
our  algorithms  for  dynamic  programming  and  transitive  closure  respectively.  In  the 
appendix  we  provide  proofs  of  correctness  of  these  algorithms  and  also  establish  the 
optimality  of  the  area  required  by  the  array. 

2.  Dynamic  Programming 

Many  problems  can  be  solved  by  the  use  of  dynamic  programming  techniques.  In 
order  to  describe  our  array  algorithm  without  excessive  generality,  we  will  focus  on  the 
construction  of  an  optimal  binary  search  tree  which  is  a  well-known  example  of  dynamic 
programming.  An  optimal  binary  serach  tree  is  constructed  by  computing  the  following 
recurrence  (see  Knuth  [8]  for  details): 

c(ij)=“w(i,j)  +  min  {c(i,k)+c(k,j)},  l<i<j<n+l 

i  <k  <  j 


We  compute  this  recurrence  on  a  linear  array  of  n  ceils.  The  array  is  comprised  of 
four  data  belts  -  Hf,  H„  Vf,  and  V,;  two  control  belts  (each  1-bit  wide)-Hc  and  Vc  and  an 
address  belt  A^  as  shown  in  Fig.  2.1.  below. 
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Tokens  are  inserted  into  these  belts  at  the  input  of  cell  1  and  emerge  from  the  out* 
put  of  cell  n.  The  tokens  stay  on  the  same  belts  as  they  traverse  the  array.  The  tokens 
travelling  in  Hc,  Vc,  Hf,  H„  Vf,  V„  and  Aj  encounter  a  delay  of  4,  2n+3,  2,  4,  2(n+l), 
2(n+2)  and  2  clock  cycles  respectively  between  any  cell  i  and  i+1.  These  delays  can  be 
implemented  by  shift-registers.  “  ”  in  the  figure  above  denote  shift-registers  on  a  belt 
between  cells.  A  token  enters  a  cell  from  the  left  (its  input)  in  the  beginning  of  a  cycle 
and  emerges  from  the  right  (its  output)  at  the  end  of  the  cycle  (possibly  updated).  For 
example,  tokens  on  Hc  enter  cell  2  at  a  and  leave  at  6.  We  will  refer  to  the  tokens  at  a 
cell's  input  as  its  input  tokens. 

Each  cell  in  the  array  has  a  local  memory  of  size  n.  The  operation  of  a  cell  in  any 
clock  cycle  then  is  the  following.  Let  x  be  the  contents  of  the  address  token  at  the  cell’s 
input.  The  cell  updates  location  x  in  its  local  memory.  The  new  value  of  x  in  its  local 
memory  is  the  minimum  of  the  old  value,  the  sum  of  the  contents  of  its  input  tokens  on 
Hf  and  V,  and  the  sum  of  the  contents  of  its  input  tokens  on  H,  and  Vf.  If  the  control 
bit  is  set  in  its  input  control  token  on  belt  Hc  then  it  changes  the  contents  of  its  input 
tokens  on  belts  Hf  and  Vr  to  the  updated  value  of  location  x.  Lastly,  if  the  control  bit 
is  set  in  its  input  control  token  on  belt  Vc  then  it  changes  the  contents  of  its  input 
tokens  on  belts  H,  and  V,  to  that  of  its  input  tokens  on  Hf  and  Vf  respectively.  The 
linear-array  algorithm  then  is  the  following. 

1.  Store  w(i,j)  in  cell  j-i  at  location  n-i. 

2.  At  cell  1  do  the  following: 

a.  Insert  a  control  token  on  Hc  with  its  control  bit  set  at  time  2kn+2,  \/k>0. 

b.  Insert  a  control  token  on  Vc  with  its  control  bit  set  at  time  2kn+l,  \/k>-n. 


c.  Insert  an  address  token  initialized  to  adrress  k  on  belt  A4  at  time  2(kn+l+/),  \f 

k>0  and  \// |0</  <n. 

This  completes  the  description  of  the  algorithm.  The  effect  of  the  algorithm  is  the 
following.  Let  6— n-i  and  7= j-i.  Let  c(i,j)  denote  the  token  in  location  6  of  cell  7  that  is 
initialized  to  w(i,j)  and  eventually  transferred  onto  Hf  and  Vf. 

c(i,j)  is  computed  and  ready  in  cell  7  at  time  2[£n+l+2(7-l)  ].  The  cell  then  starts 
transmitting  c(i,j)  on  both  Hf  and  V(.  c(i,j)  travels  on  Hf  for  an  additional  27  clock  cycles 
and  is  then  transferred  onto  H,  at  cell  27.  It  then  remains  on  H,  till  eternity.  Analo¬ 
gously,  c(i,j)  travels  on  Vf  for  an  additional  2"y(n+l)  clock  cycles  before  being  transferred 
onto  V,  at  cell  27  whereupon  it  travels  on  V,  till  eternity. 

Example:  Consider  computation  of  c(l,5)  where  n=4. 

Now  c(l,5)=*w(l,5)+min  {c(l,2)+c(2,5),  c(l,3)+c(3,5),  c(l,4)+c(4,5)  }. 

c(l,3)  and  c(3,5)  are  ready  in  cell  2  at  time  30  and  14  respectively.  c(l,3)  then  trav¬ 
els  on  Hf  for  an  additional  4  (7=2)  cycles  and  reaches  cell  4  (27*-*4)  at  time  34.  c(3,5) 
travels  on  Vf  for  an  additional  20  (27n+27=20)  cycles  and  reaches  cell  4  at  the  same 
time.  From  step  (2b)  of  the  algorithm  the  control  token  inserted  at  time  1  reaches  cell 
4  at  time  34  (the  delay  on  Ve  is  2n+3  cycles/cell).  So  at  time  34  then  c(3,5)  is  on  both  Vf 
and  V,  and  c(l,3)  is  on  Hf  and  H,  (recall  the  cell  operation  when  a  control  token  on  Vc 
is  present  at  its  input). 

c(2,5)  is  ready  at  time  26  in  cell  3.  It  travels  on  Vf  from  cell  3  and  arrives  at  cell  4 
at  time  36  (the  delay  on  Vf  is  2n+2—10,cycles/cell).  The  case  of  c(l,2)  is  interesting.  It  is 
ready  in  cell  1  at  time  26.  It  then  travels  an  additional  2  clock  cycles  on  H(  till  it 


reaches  cell  2  (*]'=1)  at  time  28.  It  is  then  transferred  onto  H,.  It  travels  on  H,  for  an 
additional  8  cycles  (the  delay  on  H,  is  4  cycles/cell)  till  it  reaches  cell  4  at  time  36.  At 
time  38  c(  1 ,2)  and  c(2,5)  arrive  on  H,  and  V{  respectively  at  cell  4.  Similarly  it  can  be 
verified  that  c(  1 ,4)  and  c(4,5)  also  arrive  at  4  on  Hf  and  V,  respectively  at  time  36. 

3.  Transitive  Closure  Algorithm 

The  transitive  closure  algorithm  is  the  following  (see  [1]  for  details).  Consider  an 
nXn  matrix  A  of  0’s  and  l’s.  This  boolean  matrix  can  represent  a  directed  graph,  if  we 
let  the  vertices  of  the  graph  be  l,2,..,n  and  the  element  a^  of  the  matrix  be  1  if  there  is 
an  edge  from  i  to  j  and  0  otherwise.  The  transitive  closure  A*  of  A  is  also  a  boolean 
matrix  where  the  (ij)th  entry  (denoted  as  ajj)  is  a  1  if  and  only  if  there  is  a  directed  path 
from  vertex  i  to  vertex  j  in  the  graph.  By  definition  every  vertex  has  a  path  to  itself. 

Let  a,*  denote  a  k-path  from  vertex  i  to  vertex  j  that  passes  through  no  vertex 
numbered  higher  than  k  except  the  end  points.  The  transitive  closure  then  can  be 
evaluated  using  the  following  recurrence  (see  [1]  for  details  ): 
a^Ua^l  V  (4k)  ak<«),  l<ij,k<n 

We  compute  this  recurrence  on  a  linear  array  of  2n-l  cells.  The  array  is  comprised 
of  two  data  belts  -  Hf  and  Vf;  two  control  belts  (each  1-bit  wide)  •  Hc  and  Vc  and  an 
address  belt  A<f  as  shown  in  Fig.  3.1.  below. 


Figure  3-1 


Tokens  are  inserted  into  these  belts  at  the  input  of  cell  1  and  emerge  from  the  out¬ 
put  of  cell  2n-l.  As  in  the  algorithm  for  dynamic  programming,  these  tokens  stay  on  the 
same  belts  as  they  traverse  the  array.  The  tokens  travelling  on  Hc,  Vc,  Hf,  Vf  and  Ad 
encounter  a  delay  of  1,  (n+I),  1,  (n-f  1)  and  1  clock  cycles  between  any  cell  i  and  i+1.  A 
token  enters  a  cell  from  the  left  (its  input)  at  the  beginning  of  a  cycle  and  emerges  from 
the  right  (its  output)  at  the  end  of  the  cycle  (possibly  updated). 

Each  cell  in  the  array  has  a  local  memory  of  size  n.  The  operation  of  a  cell  in  any 
clock  cycle  then  is  the  following.  Let  x  be  the  contents  of  the  address  token  at  the  cell’s 
input.  The  new  value  of  x  is  the  old  value  that  is  ORed  to  the  ANDed  contents  of  its 
input  tokens  on  Hf  and  Vf.  If  the  control  bit  is  set  in  its  input  control  token  on  belt  Hc 
then  it  changes  the  contents  of  its  input  token  on  belt  Vf  to  the  updated  value  of  x  and 
if  the  control  bit  is  set  in  its  input  control  token  on  belt  Vc  then  it  changes  the  contents 
of  its  input  token  on  belt  Hf. 

Our  linear  array  algorithm  is  a  three-pass  one.  We  use  two  copies  of  the  matrix  A. 
Let  ajj  denote  the  (ij)lk  entry  in  one  copy  and  a,-  denote  the  same  entry  in  the  other 
copy.  Although  initially  a^  and  a^  are  the  same  in  both  the  copies,  these  values  change 
as  the  algorithm  progresses.  a,j  travels  on  Hf  and  a,j  travels  on  Vf. 

Let  c(i,j)  denote  the  token  in  location  i  of  cell  i+j-1.  Let  t,p  (l<p<3)  denote  the 
time  when  a  pass  begins.  The  linear  array  algorithm  is  the  following. 

1.  Begin  the  first  pass  at  t,1,  the  second  pass  at  t,2=t,1-f(2n-lXn+l)  and  the  third 

pass  at  time  t,J*t,l+2(2n-lXn+l). 

2.  In  every  pass  p  ( 1  <  p  <3)  do  the  following  at  cell  1. 

«.  Insert  a,j  on  Hf  at  time  t,p+n(n-l)+n(i-l)+(j-l). 


0 

6.  Insert  a,j  on  Vf  at  time  t,p-f(n-j)n+(i-l). 

c.  Insert  a  control  token  with  its  bit  set  on  Vc  when  a,,  is  inserted  on  Vf. 

d.  Insert  a  control  token  with  its  bit  set  on  Hc  when  a,j  is  inserted  on  Hf. 

e.  Insert  address  i  on  Aj  when  a^j  is  inserted  on  Hf. 

This  completes  the  description  of  the  algorithm.  At  the  end  of  the  three  passes 
c(i,j)  will  have  a  1  if  and  only  if  the  transitive  closure  of  matrix  A  has  a  1  in  that  posi¬ 
tion. 

Example:  Consider  the  graph  shown  in  Fig.  3.2  below  comprised  of  four  vertices. 


Figure  3-2 


We  illustrate  the  computation  of  al2.  In  pass  1,  a13  and  aM  (which  are  both  initial¬ 
ised  to  1)  are  inserted  at  times  t,1  +  14  and  t,1  +  2  respectively.  a13  and  a3<  meet  at  cell 
4  at  time  t,1  +  17  (a3<  travels  on  Vf  which  has  a  delay  of  5  cycles/cell).  So  c(  1 ,4)  in  cell 
4  is  set  to  1  at  time  t,1  +  17.  a14  is  inserted  at  time  t,1  +  15.  It  reaches  cell  4  at  time 
t,1  +  18  whereupon  it  is  set  to  1. 


In  the  second  pass,  aM  and  a<2  (which  are  initialized  to  1)  are  inserted  at  times 
t,2  +  15  and  t,2  +  11.  They  meet  at  cell  2  at  time  t,2  +  16  whereupon  c(l,2)  is  set  to  1. 


4.  Concluding  Remarks:  We  have  presented  a  linear  array  algorithm  for  dynamic 
programming  and  transitive  closure  problems  that  are  optimal  in  their  area  and  time 
requirements.  Our  algorithms  are  suitable  for  realization  in  VLSI.  Using  the  technique  id 
[18]  our  algorithms  can  be  made  to  run  on  several  parallel  architectures,  like  tree 
machines  [15]  and  mesh  arrays,  that  have  faulty  cells. 


Realizing  the  algorithms  in  VLSI  raises  some  practical  issues.  In  particular,  we  will 
consider  the  potential  mismatch  between  the  size  of  the  problem  being  solved  (k)  and 
the  size  for  which  the  chip  is  handled  (n).  If  k>n,  the  problem  can  be  partitioned  into 
blocks  of  size  n  and  the  chip  used  iteratively  to  handle  each  block.  An  obvious  solution 
to  handling  the  case  when  k<n  is  to  consider  the  problem  of  size  k  as  part  of  a  bigger 

problem  of  size  n  (obtained  by  padding  the  problem  of  size  k  with  dummy  elements). 

2 

This  would  however  result  in  a  time  penalty  factor  of  ( — )  over  that  obtained  by  using 

k 

a  chip  of  compatible  size.  An  alternative  approach  is  to  configure  the  chip,  as  a  prepro¬ 
cessing  step,  to  match  the  problem  size.  This  would  require  decreasing  the  number  of 
cells  configured  to  k  and  decreasing  the  size  of  the  buffer  in  each  cell  appropriately.  The 
selection  of  cells  can  be  efficiently  accomplished  on  a  reconfigurable  network  such  as  the 
CHiP  [14].  Changing  the  buffer  size  requires  the  shift  registers  implementing  the  buffers 
to  have  variable  lengths,  similar  to  the  proposal  in  [3].  However  requiring  the  shift  regis¬ 
ter  length  to  be  continuously  variable  (that  is,  for  all  values  of  k  from  1  to  n  )  would  oe 
prohibitively  expensive  in  terms  of  layout  and  area  complexity.  The  algorithms  can  be 
modified  to  run  on  k  cells  without  changing  the  buffer  size  (details  omitted  in  this 
paper).  These  modified  algorithms  have  a  time  complexity  of  O(nk)  and  hence  this 

results  in  a  time  penalty  factor  of  0(-~-).  Let  the  buffer  of  size  N  be  divided  into  a  equal 

Nm 


partitions,  which  can  be  tapped  at 


,  Q>m>l.  Then  a  problem  of  size  k, 


N(ni-n  i^N 

m  ,  ,  _  Mm 

- - - <k-— •  W,U  emP!°y  a  buff<*  of  sue  n=““-  The  time  Penalty  factor  in 

such  a  case  will  be  ).  ^  *s  seen  that  this  factor  will  never  exceed  2.  This 

implies,  for  example,  that  with  just  four  partitions  to  the  buffer,  problems  as  small  as 
1  th 

j  the  original  size  will  incur  a  time  penalty  of  at  most  a  factor  of  2.  For  most  values 

of  k,  the  penalty  will  be  even  less  as  illustated  in  Fig.  4.1  below,  which  is  a  typical  profile 
of  the  performance  degradation  factor  versus  problem  size  for  the  case  of  the  buffer  split 
into  four  partitions. 


f 

Figure  41 


An  ideal  solution  to  small  problem  sizes  is  to  design  an  algorithm  on  an  array 
where  the  storage  in  any  cell  is  independent  of  the  problem  size.  Recently,  we  have  been 
able  to  do  this  for  matrix  multiplication  [13].  We  are  currently  investigating  algorithms 
for  both  these  problems  that  can  run  on  a  linear  array  where  the  storage  in  any  cell  can 
be  made  independent  of  the  problem  size. 
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Appendix 

We  now  provide  proofs  of  correctness  for  the  dynamic  programming  and  transitive 
closure  algorithms.  We  will  also  show  that  the  area  required  by  the  array  for  these  two 
algorithms  is  optimal.  In  the  proofs  to  follow,  in  any  reference  to  a  control  token  we 
will  assume  that  its  control  bit  is  set. 


A.  Proof  of  the  Dynamic  Programming  Algorithm 

We  first  establish  that  the  algorithm  described  in  Section  2  correctly  computes 
c(l,n+l).  Let  7  =»j-i  and  6  =n-i,  l<i<j<n+l.  In  the  following  Lemma  we  establish  the 
time  at  which  c(i,j)  is  transferred  onto  Hf  and  Vf. 


Lemma  A.1:  c(i,j)  is  transferred  onto  Hf  and  Vf  in  cell  *y  at  time  2[£n+l+2(7-l)J. 


Proof:  c(i,j)  will  be  transferred  onto  Hf  and  Vf  only  if  there  is  a  control  token  present 
on  Hc  at  cell  q’s  input  at  time  2[jn+l+2('y-I)].  This  means  that  this  control  token  must 
have  been  inserted  at  Hc  of  cell  1  at  time  2[$n+l+2(7-l)]-4(7-l)  (the  elements  in  Hc 
encounter  a  delay  of  4  cycle/cell).  By  step  (2a)  of  the  algorithm,  a  control  token  is 
inserted  into  the  array  on  Hc  at  time  2[kn+l],  CU 


Lemma  A.2:  (1)  c(i,j)  travels  on  Hf  for  27  cycles  and  is  then  transferred  onto  H,  in  cell 
27,  and  (2)  c(i,j)  travels  on  Vf  for  2(n+l)7  cycles  and  is  then  transferred  onto  V,  in  cell 


Proof:  We  will  prove  (1)  as  the  proof  for  (2)  can  be  established  along  similar  lines.  By 
Lemma  A.l,  c(i,j)  is  transferred  onto  Hf  at  time  t,— 2[£n+l+2(7-l)].  In  27  additional 
cycles  it  will  reach  cell  27  (delay  on  Hf  is  2  cycles/cell). 


VvV\<v Y  v»Y  V- 
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Id  order  for  c(i,j)  to  be  transferred  onto  H,  at  cell  2 7,  it  must  meet  a  control  token 
on  Vc  at  the  input  of  2-y  at  time  tj+2^.  This  means  that  this  control  token  must  have 
been  inserted  into  the  array  at  time  t3=tl+2'^(2>I)[2(n+l)+l)  where  2(n+l)+l  is  the 
delay/cell  encountered  by  control  tokens  on  Vc.  Substituting  £=n-i  and  7*j-i,  t$  reduces 
to  2n[n-2(j-i)+l-i]+l.  Now  2-y<n  as  there  are  only  n  cells.  So  2(j-i)<n.  Also  i<n  and 
hence  [n-2(j-i)+l-i]>-n.  From  step  (2b)  of  the  algorithm,  a  control  token  is  inserted  into 
the  array  on  Vc  at  time  2kn+l,  \/k>-n.  I  [ 


We  are  now  ready  to  establish  our  main  result  about  the  correctness  of  computing 


Theorem  A.ls  c(i,j)=Wj:+  min  (c(i,k)+c(k,j)}  when  it  is  transferred  onto  and  Vf. 

■  <k<j 


Proofs  We  prove  this  by  induction  on 

Bash.  7—I.  The  correct  value  of  c(i,j)  when  7*1  is  its  initial  value  w(i,j)  which  is 
stored  in  location  6  of  cell  1.  At  time  2£n+2,  address  6  and  a  control  token  are  inserted 
on  the  address  belt  and  Hc  respectively.  So  w(i,j)  gets  transferred  onto  Hf  and  V(. 

Inductive  Step.  We  have  to  show  that  the  Theorem  holds  \J\  and  \/j  such  that 
j  -i  *7+1.  Let  i  *i+a-l  and  j  *a+j.  We  will  then  have  to  show  that  c(i+a-l, 
a+j)*  min  {c(i+a-l,k)+c(k,a+j)}.  To  show  this  we  must  show  the  following. 

i+a-l<k<o+j 

1.  c(i+a-l,k)  and  c(k,o+j)  meet  at  cell  7+1  before  c(i+cr-l,  a+j)  is  transferred,  and 

2.  when  they  meet,  the  address  on  the  address  belt  at  the  input  of  7+1  is  n-i-a+1. 


By  the  inductive  hypothesis  and  Lemma  A.l,  c(i+a-l,k)  is  correctly  computed  when 
it  is  transferred  onto  Hf  and  Vf  at  cell  k-i-o+1  at  time  t,*2[(n-i-o+l)n+l+2(k-i-a)].  It 


then  travels  on  Hf  for  an  additional  2(k-i-a+l)  cycles.  Subsequently,  it  travels  on  H,  till 
it  reaches  cell  7+1.  Let  t2  denote  the  time  taken  to  reach  cell  7+1  after  transfer.  Now 
t2~{2(k-i-a+l)}+[4(7+l-2k+2i+2a-2)].  The  expression  within  {  }  is  the  time  it  travels 
on  Hf  and  that  within  [  ]  is  the  time  it  travels  on  H,.  t2  can  be  simplified  to  2[i+2j+3a- 
3k- 1].  So, 

t,  +  t2» =2[(n-i-a+l)n+l+2(k-i-a)-3k+3a+i+2j-l] 

=2[(n-i-a+l)n-k+a-i+2j]  ...  (*) 

By  the  inductive  hypothesis  again,  c(k,a+j)  is  correctly  computed  in  cell  cr+j-k 
when  it  is  transferred  onto  Hf  and  Vf  at  time  t3=2[(n-k)n+l+2(a+j-k-l)].  It  then  travels 
on  Vf  till  it  reaches  cell  7+1.  Let  t4  denote  this  travel  time.  So  t4=2(n+lX7+l*a-j+k) 
(recall  that  delay /cell  on  Vf  is  2(n+l)  clock  .cycles).  Now  t3  +  t4  can  be  simplified  to 
2[(n-i-a+l)n-k+a-i+2j]  which  is  the  same  as  (*). 

We  will  next  show  that  (*)<  time  at  which  c(i+a-l,  a+j)  is  transferred  onto  Hf 
and  Vf.  By  Lemma  A.l,  this  time  is  2[(n-i-a+  l)n+l+2(j-i)].  We  then  have  to  show  that 
2[(n-i-a+l)n+l+2(j-i)]>2[(n-i-a+l)n-k+a-i+2j]  which  reduces  to  showing  that 
-k+a-i+2j<l+2(j-i)  and  this  is  true  as  i+a-l<k<a+j. 

In  the  proof  we  had  assumed  that  c(i+a-l,k)  travels  on  H(  and  H,  whereas 
c(k,a+j)  travels  on  Vf  alone.  We  can  also  show  in  the  symmetric  case  where  c(i+o-l,k) 
travels  on  Hf  and  c(k,  a+j)  travels  on  Vf  followed  by  V,  that  they  still  meet  at  cell 
7+1- 

Lastly,  we  must  show  that  the  address  on  the  address  input  at  cell  7+1  is  n-i*a+l. 
This  address  must  have  been  inserted  on  the  address  belt  A4  of  cell  1  at  time  (*)-2(j-i) 
which  can  be  simplified  to  2{(n-i-a+l)n+l+[a+j-l-k]}.  From  step  (2c)  of  the  algorithm, 
this  address  is  n-i-a+1  if  0<[a+j-l-k]<n.  Now  l<k<  a+j  <  n+1  and  so  0<a+j-k 


and  hence  0<ct+j-k-l.  Also  a+j<n+l  and  hence  -k-l+a+j<n  as  k>0  tZH 


B.  Proof  of  the  Transitive  Closure  Algorithm 

We  establish  that  after  three  passes  the  c(i,j)’s  contain  the  transitive  closure  A*. 
Our  proof  is  along  similar  lines  to  the  proof  in  [16]  for  the  mesh-array  algorithm  in  [6]. 

Recall  that  a  k-path  from  vertex  i  to  vertex  j  denotes  a  path  from  i  to  j  that  goes 
through  no  vertex  numbered  higher  than  k  except  the  endpoints.  Consequently,  i 
and/or  j  may  exceed  k. 

In  the  proofs  that  follow,  the  expression  within  {  }  will  denote  the  time  at  which 
the  elements  are  inserted  in  the  array  and  that  within  [  ]  will  denote  the  time  it  takes  to 
reach  a  cell  after  insertion. 

Lemma  B.l:  In  any  pass  a*  and  akj  meet  at  cell  i+ j-1. 

Proof:  a^k  reaches  cell  i+j-1  at  time  tj— {t,p+n(n-l)+(i-l)n+k-l}+[i+j-2].  Similarly  akj 
reaches  cell  i+j-1  at  time  t2*-{t,p+(n-j)n+(k-l)}+[(i+j-2)(n+l)].  Now  akj  travels  at  a 
delay  of  (n+l)/cell  on  Vf  and  hence  (i+j-2)  is  multiplied  by  a  factor  (n+1)  in  t2.  The 
expression  in  t2  can  be  simplified  to  t,p+n(n-l)+(i-l)n+(k-l)+(i+j-2)  which  is  the  same  as 
t,.  □ 

Lemma  B.2:  In  any  pass,  (1)  ay  and  ay  meet  at  cell  i+j-1,  and  ay  and  ay  also  meet  at 
cell  i+j-1. 

Proof:  We  will  prove  (1)  and  the  proof  for  (2)  is  similar.  Now  ay  arrives  at  cell  i+j-1  at 
time  ti>—  {t,p+n(n-l)+(i-l)n+j-l  }+[i+j-2]  and  ay  arrives  there  at  time  t2«*{t,p+(n- 
j)n+(j-l)}+[(i+j-2)(n+l)]  which  can  be  simplified  and  shown  to  be  the  same  as  t,. 

P 


Corollary  B.l:  a,j  and  a,j  are  updated  to  the  value  of  c(i,j)  when  they  pass  cell  i+j-1. 


Proof:  a,j  and  address  i  are  inserted  at  the  same  time  in  the  array.  They  both  travel  ai 
the  same  speed  and  hence  reach  cell  i+j-1  at  the  same  time.  A  control  token  is  inserted 
on  Vc  along  with  ajj.  They  both  travel  at  the  same  speed  and  hence  when  a(J  meets  aw  it 
gets  updated.  A  similar  argument  will  prove  that  a^  is  also  updated.  1  1 

Lemma  B.3:  In  any  pass,  (1)  a^  reaches  cell  i+k-1  at  time  t,p  +n(n-l)+(i-l)n  +(j- 
l)+(i+k-2),  and  (2)  a*j  reaches  cell  k+j-i  at  time  t,p+n(n-l)+(k-l)n+(j-l)+(i+k-2). 

Proof:  Immediate  from  steps  (2a)  and  (2b)  of  the  algorithm.  d 

Lemma  B.4:  Suppose  there  is  a  min(i-j)-path  from  i  to  j,  that  is,  a  path  that  goes 
through  no  vertex  as  high  as  its  end  points.  Then  on  pass  1  of  the  algorithm,  a^,  a)J  and 
c(ij)  are  all  set  to  1  at  or  before  a^  and  ajj  reach  cell  i+j-1. 

Proof:  We  prove  this  by  induction  on  the  length  of  the  shortest  path  from  i  to  j.  For 
the  basis,  paths  of  length  0  or  1,  the  Lemma  holds  as  ajj  and  a^  are  1  initially  and  c(i,j) 
is  assigned  1  when  either  ajj  or  a^  ,  whichever  reaches  cell  i+j-1  earlier. 

For  the  induction,  suppose  there  is  a  min(i,j)-path  of  length  two  or  more  from  i  to 
j.  Then  there  exists  some  other  vertex  l  on  the  path.  Let  l  be  the  highest  numbered  ver¬ 
tex  on  this  path.  Now  /  <i  and  /  <j  because  the  path  is  a  min(ij)-path.  Since  /  exceeds 
any  other  vertex  on  this  path,  there  is  a  min(i,/ )-path  from  i  to  /  and  a  min(/J)-path 
from  /  to  j,  and  both  of  these  paths  are  shorter  than  the  path  from  i  to  j. 

By  the  inductive  hypothesis  aj/  and  a/j  are  set  to  1  at  or  before  ai/  reaches  cell 
i+f-1  and  a/j  reaches  cell  /  +j-l  respectively.  Let  tt  and  t2  be  the  times  when  a//  and  a/, 
keach  cell  i+f - 1  and  f+j-1  respectively.  From  Lemma  B.3,  tt— t,'+n(n-l)+(i-l)n+(/- 


l)+(i+/-2),  and  t2=t,1+n(n-l)+(/-l)n+(/-l)+(/+/-2). 


Let  t3  be  the  time  at  which  they  meet  in  cell  i+j-1.  Now  t3={t,'+n(n-l  )+(i-l )n+/- 
1}  +[i+j-2].  As  i>/  and  j > / ,  t3>  tj  and  ts>t2.  Recall  that  address  i  is  inserted  into 
the  array  along  with  a^  (step  2(e)  of  the  algorithm).  Consequently  c(J  is  assigned  t. 

Let  t<  be  the  minimum  of  the  time  taken  by  a^  and  a^'  to  reach  cell  i+j-1  and  so 
t4=tap-J-n(n- 1  )-h(i- 1  )n  -hmin(i,j)-H-(i-l-j»2).  i>/  and  j>/  and  so  t4>t3.  Hence  a(J  and  a,j 
are  assigned  1  when  they  reach  cell  i+j-1.  I  [ 

Lemma  B.5:  After  pass  2  of  the  algorithm, 

a.  If  there  is  a  j-path  from  i  to  j,  then  c(i,j)  and  a^  are  set  to  1  by  time  t,p+n(n-l)+(i- 
l)n+(j-l)+(i+j-2). 

b.  If  there  is  an  i-path  from  i  to  j,  then  c(ij)  and  a*-  are  set  to  1  by  time  t,p+n(n- 
l)+(i-l)n+(i-l)+(i+j-2). 

c.  If  there  is  a  max(i,j)-path  from  i  to  j  then  c(i,j)  is  set  to  1  at  some  time. 

Proof:  We  prove  this  by  induction  on  the  path  length.  If  the  length  is  1  then  a^a,-  ) 
must  be  1  if  there  is  a  j-path  (i-path)  from  i  to  j.  Hence  c(ij)  will  be  assigned  1  when  a,, 
or  a*j  reaches  cell  i+j-1. 

For  the  induction,  suppose  there  is  a  j-path  of  length  at  least  two  from  i  to  j.  Let 
/  be  the  highest  numbered  vertex  on  the  path.  Then  l  <j  and  there  is  a  shorter  /-path 
from  i  to  /.  By  the  inductive  hypothesis,  a^  is  set  to  1  by  time  tt»t,2+n(n-l)+(i- 
l)n+(/-l)+(i+/-2).  Since  l  is  chosen  to  be  the  highest  numbered  vertex  on  the  j-path, 
there  is  a  min(/J)-path  from  l  to  j.  By  Lemma  B.4,  a(j  is  already  1  by  end  of  pass  1. 
Thus  at  time  t2**t,2+n(n-l)+(i-l)n+(/-l)+(i+j-2)  which  is  later  than  t1(  at/  and  at)  meet 
at  cell  i+j-1  at  which  time  c(i,j)  is  set  to  1.  It  can  be  easily  verified  that  a,j  and  a,,  arrive 
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at  cell  i+j-1  later  than  t2.  Hence  they  too  are  assigned  1. 

We  have  proved  (a).  A  similar  argument  will  establish  (b)  and  these  two  together 
imply  (c).  □ 


We  are  now  ready  to  establish  our  main  result. 


Theorem  B.l:  After  the  third  pass,  c(i,j)  is  set  to  1  if  there  is  any  path  from  i  to  j. 


Proof:  By  Lemma  B.4,  if  there  is  a  max(i,j)-path  from  i  to  j  then  a,j  is  already  1  after 
pass  2.  Otherwise,  the  highest  numbered  vertex  /  on  some  path  from  i  to  j  is  larger  than 
either  i  or  j.  This  means  that  there  is  an  /- path  from  i  to  /  and  an  /-path  from  /  to  j. 
The  /-paths  from  i  to  /  and  /  to  j  are  a  max(i,/ )-path  and  max(/  ,j)-path  respectively  by 
the  maximality  of  /.  By  Lemma  B.S,  a,/  and  a/j  are  set  to  1  by  end  of  pass  2.  They  meet 
again  in  cell  i+j-I  in  pass  3  at  which  time  c(ij)  is  assigned  I.  Q 


C.  Area-Optimality  of  the  Dynamic  Programming  Algorithm 

The  recurrence  used  to  compute  the  dynamic  programming  problem  (see  Section  2) 
can  be  rewritten  as: 
cij°*  “  wij>  l<i<j<n+l 
cj  *“»>  -  c,(‘4mm  {  4i“,>+c#“l>} 

We  will  establish  that  the  area  required  by  the  linear  array  to  compute  the 
recurrence  is  assymptotically  optimal.  We  establish  this  result  under  the  following 
assumptions. 

1.  Any  special  purpose  machine  (a  chip  in  VLSI)  that  computes  the  value  of  c,^**1* 
must  compute  c,^1*1*,  cj***1^,  (ty  i < k< j). 


•V  *'  -V 


2.  The  comparison  and  addition  operation  requires  non- zero  time. 

3.  The  only  input/output  done  by  the  machine  is  to  read  w(J  and  output  c,|fiDl1*  (that 
is,  we  do  not  allow  partially  updated  cu  to  leave  the  machine  and  re-enter  at  a  later 
time). 

Definition  C.l:  c(J  is  said  to  be  assigned  a  value  when  either 

a.  w^  enters  the  machine  or 

b.  +  cd®9*1*  has  been  computed  for  some  k. 

Under  these  assumptions  we  will  establish  that  0(n2)  is  a  lower  bound  on  the 
storage  required  by  formulating  the  evaluation  of  the  recurrence  used  to  compute  the 
dynamic  programming  problem  as  a  game  played  with  colored  tokens  on  a  graph  G  con¬ 
structed  as  follows. 

Let  G»»(V,E)  where  V-»{V,j  |  l<i<j<n},  and  E=*{  ( V, v  V1+Ij)  |  l<i<j-l<n)  (j 
{(V^,  V1J+l)|l<i<j<n} 

Fig.  C.l  below  illustrates  the  graph  for  n~*16. 


I*  H  '1  '*•  <•  «•  1  V  ■»  ■*  * 

Figure  C-1 

The  rules  of  the  game  are  as  follows. 

Initially  a  white  token  is  present  on  every  vertex  in  V. 

When  Cjj  is  first  assigned  a  value  in  the  machine,  the  token  on  Vy  becomes  grey  in 
color. 

When  Cjj®"*1*  leaves  the  machine,  the  token  on  Vy  becomes  black  in  color. 

Once  a  token  changes  color  it  cannot  r>  turn  to  the  color  it  had  earlier. 

All  tokens  change  color  from  white  to  grey  and  finally  to  black.  The  computation  is 
over  when  the  token  on  V, ,  becomes  black. 


Each  token  spends  a  non- zero  amount  of  time  when  it  is  grey. 


We  introduce  the  following  notations  which  will  be  used  in  the  proofs. 

Let  V.  A  column  of  X  is  a  subset  of  the  vertices  of  X  with  the  same  first  index.  Simi 
larly,  a  row  of  X  is  a  subset  of  vertices  of  X  with  the  same  second  index. 

Let  Xw=  {  V.  jtX  |  the  token  on  V, ,  is  white  },  X,=  {  V^eX  |  the  token  on  V(J  is  grey  } 

and  Xb=  {  V^eX  |  the  token  on  V,j  is  black  }. 

Let  A,  B  and  C  be  three  subsets  of  V  defined  as  follows 

a4v,j(V|  i<i<j,i+l<i<^|  B-{  V./I  l<i<i  an. 

Fig  C.2  illustrates  the  three  subsets  when  n=16. 
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Figure  C'2 


For  convenience,  we  will  assume  that  n  is  a  multiple  of  four.  Let  t  denote  the  time 


at  which  cB  3b  obtains  its  final  value  c|,fiB1^3B.  Now  choose  t<t  to  be  the  time  at 

7+1’T  T+1'T 

which  c(BfiBll)  and  c  (V  — +l<k<-^-  )  have  been  computed  but  c  Bfinii*jB  has  been 

-+i,k  k—  4  4  7+l'T 

partially  computed  (that  is,  final  assignment  has  not  yet  been  made).  From  the 

recurrence  relation  it  is  seen  that  such  a  time  instant  must  occur  by  assumption  2. 

We  will  obtain  a  lower  bound  on  the  number  of  grey  tokens  on  vertices  in  V  at 
time  t.  Since  a  grey  token  corresponds  to  a  Cjj  value  that  is  in  the  machine,  by  assuming 
that  each  such  value  requires  unit  storage,  we  will  obtain  the  desired  lower  bound  on  the 
storage. 

Lemma  C.l:  If  the  token  on  any  V,  j<V  is  white  then  the  token  on  any  Vit  (t<j)  and 
any  VBJ  (s>i)  must  be  either  white  or  grey. 

Proof:  If  the  token  on  V,  j  is  white  then  Cjj  has  not  been  assigned  a  value.  Computing 
c.ffin*!)  rCqU,res  the  values  of  J>t)  and  c/JbbI)  (V  s>i).  Hence,  none  of  these 

could  have  left  the  machine.  Therefore,  the  tokens  on  Vi  t  (t<j)  and  V,  j  (s>i)  cannot  be 
black.  1  f 


Lemma  C.2:  At  time  t,  (a)  Cb=<l>  and  (b)  AW“<1>. 


Proof:  (a)  At  time  t,  cjjfiBll*3B  has  been  partially  computed.  Suppose  VxytC  has  a  black 


— +i, — 

4  4 


token  on  it  at  time  t.  Since  cJV1^  requires  clfiBl1'  for  computation  and  cjj,6BBl*  requires 

7+l,r  7+I,y 

cBfiai,*3n  f°r  computation,  this  implies  that  c|,SbbI'3b  has  already  been  computed  —  a  con- 


- Hi, - 
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tradiction. 


— +i, — 
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(b)  At  time  t,  c  ^3“**  (s> — l-l)  has  been  computed.  Suppose  V  <A  (x>— +1)  had  a 
»,—  4  4 

white  token  on  it.  Then  c  could  not  have  been  computed  --  a  contradiction.  Since 

XT 

c ,  3n  has  been  assigned  a  value,  the  token  on  V,  3l  must  be  grey.  Finally,  since 

7+,'T  7+l'T 

3d  3d 

c(fiB»J)  ^  t<- — j  have  been  computed,  all  V,  <A(t< — )  must  have  a  grey  or  black 


token.  Hence  Aw=$.  I  f 


Lemma  C.3s  If  |  Cw  |  >|j  then  !  B,  |  +  |  Bw  | 


Proof:  Since  |  Cw  |  >  —  at  least  -2-  columns  of  C  have  a  white  token.  (A  column  is 

o2  o 

said  to  have  a  white  token  if  the  token  on  at  least  one  vertex  in  the  column  is  white  ). 
Then  by  Lemma  C.l,  at  least  -5-  columns  of  B  must  not  have  a  black  tokens.  Thus  at 

O 

least  -5-X-5-  °f  the  vertices  in  B  have  either  grey  or  white  tokens  on  them  and  hence, 


Lemma  C.4:  It  |  B.  |  >  then  I  A.  |  +  |  A,  |  >-^j- 


D^  D 

Proof:  Since  !  B_  I  > — ,  at  least  —  rows  of  B  must  have  a  white  token.  By  Lemma 
'  “  32  16 

C.l,  at  least  —  rows  of  A  must  not  have  a  black  token.  Thus,  at  least 
16 

1  D  D  d^ 

—  X  —  X — = -  of  the  vertices  in  A  must  have  grey  or  white  tokens,  that  is 


Theorem  C.l:  At  time  t,  the  number  N  of  grey  tokens  on  vertices  in  V  is  ft(n2). 


2  2 

Proofs  Since  I  C  I  =— ,  by  Lemma  C.2,  it  follows  that  at  time  t,  I  C_  I  +  I  C_  I  =— . 

1  16  *  16 

n2  n2 

Thus,  at  least  one  of  the  following  must  hold:  j  Cg  |  > —  or  [  Cw  |  > — .  If 

32  32 

|CJ  >~  then  N=fl(n2).  If  |  Cw  |  >  then  by  Lemma  C.3,  |  Bg  |  +  |  B„  |  >i£. 

2 

Again,  at  least  one  of  the  following  two  conditions  must  hold:  |Bf|>  —  or 

32 

|BW|>|~.  If  |BJ>^-  then  N~fi(n2).  If  |  Bw  |  >|1  then  by  Lemma  C.4, 
2 

|  Aw  |  +  |  Ag  |  > — — .  By  Lemma  C.2  however,  at  time  t,  |  Aw  |  and  thus, 
512 

|Ag|>  — — .  Hence,  in  all  cases  N=0(n2).  •  d 
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D.  Area-Optimality  of  the  Transitive  Closure  Algorithm 

We  will  now  establish  that  the  area  required  by  the  transitive  closure  algorithm  is 
optimal.  We  obtain  a  lower  bound  on  the  storage  required  to  compute  matrix  multiplica¬ 
tion.  As  matrix-multiplication  and  transitive  closure  are  related  [16]  the  lower  bounds  on 
the  area  are  the  same  to  within  a  constant  factor. 

Let  a,j,  b,j,  and  Cjj  denote  the  (ij)'k  element  in  matrix  A,  matrix  B  and  result  matrix 
C  respectively.  We  establish  this  result  under  the  following  assumptions: 

1.  Any  special-purpose  machine  (like  a  linear  array)  that  multiplies  matrices  A  and  B 
must  compute  a)kbkj  (  \/i,  Vj  and  \/k  |l<ij,k<n). 

2.  The  special-purpose  machine  has  a  constant  number  of  I/O  ports. 


3.  The  elements  of  the  matrices  A,  B  and  C  are  inserted  into  the  special-purpose 
machine  only  once  through  the  input  ports. 

Under  these  assumptions  we  will  establish  that  H(n2)  is  a  lower  bound  on  the 
storage  that  is  required  by  any  special-purpose  machine  that  multiplies  two  nXn 
matrices.  We  obtain  this  bound  by  formulating  the  computation  of  matrix  multiplication 
as  a  game  played  with  tokens  on  an  undirected  graph  constructed  as  follows: 

Let  Gk=(Vk  Ek),  k=l,..,n  where 

Vk={fik  hkj  |  i=“l,..n  and  j*—l,..,n}  and 
Ek={<f,k.  hkj>  I  1=1. -.n  and  j— l,..,n} 

The  rules  of  the  game  are  as  follows: 

1.  A  token  is  placed  on  f;k  (hkj)  when  a*k  (bkj)  is  inserted  into  the  machine. 

2.  Updating  c(J  (  by  adding  ajkbkj  to  Cjj  for  some  k)  results  in  removing  the  edge 
<fik  hkj>  from  Gk. 

3.  An  edge  is  removable  only  if  there  are  tokens  at  both  end  vertices. 

4.  A  token  from  a  vertex  is  removable  only  if  all  the  edges  incident  on  the  vertex  are 
removable.  When  a  token  from  a  vertex  is  removed  then  all  the  incident  edges  on 
the  vertex  are  deleted.  (The  token  will  eventually  leave  the  machine  and  will  never 
reenter.) 

We  will  assume  that  each  token  occupies  unit  storage  (0(1)).  We  also  assume  that  a 
partially  updated  ctJ  also  occupies  unit  storage.  (At  any  instant  of  time  cu  is  partially 
updated  if  there  exists  some  k  (l<k<n)  such  that  ajkbkj  either  has  not  been  computed 
and/or  added  to  c,j  by  that  time  instant  .) 


Let  xk  be  the  earliest  time  at  which  the  first  token  in  Gk  is  removable  and  let  yk  be 
the  earliest  time  at  which  all  the  tokens  in  Gk  are  removable.  Since  only  a  constant 
number  of  tokens  enter  the  machine  at  any  time,  by  choosing  n  sufficiently  large,  we  can 
ensure  that  (l<k<n)  xk<yk.  0  <k<n),  let  lk"“(xk.  yk)  denote  the  time  interval 
between  and  including  xk  and  yk. 


Lemma  D.l:  At  any  time  t  such  that  xk<t<yk,  there  are  at  least  n  tokens  in  Gk. 


Proof:  Without  any  loss  of  generality,  let  the  first  (or  one  of  the  first  if  there  are  more 
than  one)  token(s)  that  can  be  removed  from  Gk  be  the  one  on  vertex  fmk.  At  t,  =  xk, 
then,  there  must  be  tokens  on  all  hkj  (l<j<n).  We  claim  that  no  token  on  any  hk] 
will  be  removable  at  any  t  (xk<t<yk). 

Assume  this  is  not  the  case,  and  at  t<yk,  let  hkj  be  the  first  vertex  (or  one  of  the 
first  vertices)  from  which  a  token  is  removable.  This  implies  that  there  must  be  tokens 
on  all  vertices  fjk  that  still  have  incident  edges.  This  means  that  all  the  edges  still 
remaining  in  Gk  are  removable,  and  consequently  all  the  remaining  tokens  in  Gk  are 
removable  at  time  t.  But  then  t*=yk  —  a  contradiction.  Hence  no  token  on  any  Iikj  is 
removable  at  any  time  t  (xt<t<yk).  Each  hkj  has  a  token  and  hence  the  Lemma. 


Lemma  D.2:  Let  m<n.  For  any  i,  if  t>yj  and  Gj  has  m  tokens  then  at  least  —  edges 

2 

must  have  been  deleted  from  Gj. 


Proof:  There  are  m  tokens  in  Gj.  Since  t>yj,  the  absence  of  a  token  on  a  vertex  means 
that  all  the  n  edges  incident  on  the  vertex  have  been  deleted.  (At  t— y,,  all  edges  in  Gj 
are  removable).  The  number  of  absent  tokensa«2n*m  which  is  greater  than  n  as  m<n. 


§ 


a 


§ 


Now  one  edge  is  in  common  with  at  most  two  vertices.  Thus  the  2n-m  absent  tokens 

q2 

result  in  at  least  —  deleted  edges.  M 
2 


Let  us  impose  an  ordering  on  the  sets  Ik  such  that  Xj  <x,  <..<x,b  and  let  T 

4  I  yk<xi.  and  A— {Ik  I  yk>xi„}- 


Theorem  D.l:  Any  matrix-multiplication  machine  requires  ft(n2)  storage. 


Proof:  Since  |r|+|Aj=n,  either  |T|  >  -2-  or  |A|  >  — . 

It  It 


Cast  1 :  |A|  >  —  (see  Fig.  D.l) 
2 


Figure  D-1 


At  t— XiB  all  the  intervals  in  A  satisfy  Lemma  D.l.  Hence  at  t— x,^  there  are  at  least 
n(-2-)  tokens  in  the  machine.  So  the  storage  required  is  Q(n2). 

it 


Cast  ft  |r|  >  —  (see  Fig.  D.2) 


v./.vy. v.  .. 


riw 


Figure  D2 

At  t»x,a,  either  all  Gk,  such  that  Ik£A,  have  n  tokens  on  them,  or  at  least  one  of  them 
has  less  than  n  tokens.  If  every  Gk  has  n  tokens  then  the  storage  required  is  again 
0(n2).  If  any  one,  say  Gr,  has  less  then  n  tokens  then  by  Lemma  D.2  Gr  must  have 
n2 

released  at  least  —  edges.  Now  each  released  edge  corresponds  to  a  partially  updated 
2 

Cjj.  None  of  the  c,j’s  could  have  left  the  machine  as  all  of  them  are  finally  updated  only 

Q2 

at  t>Xim.  Thus  at  any  time  t  (yk<t<XiJ  there  are  at  least  —  partially  updated  c^s  in 

the  machine.  The  case  yk— X;b  is  covered  by  assumption  2  which  precludes  the  possibil¬ 
ity  of  all  these  c(J’s  being  instantaneously  updated  and  leaving  the  machine.  So  the 
storage  required  for  the  partially  updated  Cjj’s  must  be  0(n2).  I  1 
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